An efficient algorithm for finding short approximate non-tandem repeats
نویسندگان
چکیده
We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences.
منابع مشابه
Finding the Position of the k-Mismatch and Approximate Tandem Repeats
Given a pattern P , a text T , and an integer k, we want to find for every position j of T , the index of the k-mismatch of P with the suffix of T starting at position j. We give an algorithm that finds the exact index for each j, and algorithms that approximate it. We use these algorithms to get an efficient solution for an approximate version of the tandem repeats problem with k-mismatches.
متن کاملTandem repeats over the edit distance
MOTIVATION A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repea...
متن کاملAn Algorithm for Approximate Tandem Repeats
A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of lengt...
متن کاملEditorial: Could Speciation Across Evolution be Governed by Genetic Switch Codes at Short Tandem Repeats?
متن کامل
Finding Approximate Tandem Repeats with the Burrows-Wheeler Transform
Approximate tandem repeats in a genomic sequence are two or more contiguous, similar copies of a pattern of nucleotides. They are used in DNA mapping, studying molecular evolution mechanisms, forensic analysis and research in diagnosis of inherited diseases. All their functions are still investigated and not well defined, but increasing biological databases together with tools for identificatio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره 17 Suppl 1 شماره
صفحات -
تاریخ انتشار 2001